The COVID-19 pandemic, in its ongoing 1.5 year span, has radically changed many ways of life, and inflicted many casualties globally. To safeguard our future in preparing for similar crises in the future, this report investigates the indicators of a country’s response to COVID, stratified by political, social, and economic factors.
We find that most countries had similar experiences with Covid to varying magnitudes, and there are a select few which handled the pandemic exceptionally poorly. The behaviour of countries based on stringency is associated with their broad geographic region, potentially due to COVID cases spreading across borders or similarity in government response between neighbouring countries. The political, social and economic implications of the clusters are inconclusive without consulting experts due to high variability in the PSE values within each cluster.
The analysis aims to assist Social and Political scientists in understanding the different governments responses to COVID-19, and how they differ between countries within their respective cluster groups. This would be useful in supplementing their research through identification of trends in PSE factor groups.
This project aims to uncover insights about the relationship between new cases or government response with various Political, Social and Economic (PSE) factors. The motivation is to understand the impact of PSE factors on how countries respond to the COVID pandemic, in order to understand how these countries will respond to crises in the future. The combination of multiple disciplines concerning the fields of epidemiology, politics, sociology and economics extends our knowledge of the COVID-19 crisis to more than a disease, but rather an overarching global phenomenon that helps us understand the behaviours of different countries as well as predict future crises.
The innovation of this project lies in the exploration of the intuitive reasons behind how countries respond to the COVID-19 pandemic in terms of PSE factors. Results from the investigation are visualised in the form of SOM diagrams, world maps, bubble charts and interactive maps, such that a variety of visualisations are available for interpretation in an easily-consumable form.
The target audience for this project are Social and Political scientists in industry and academia, including social epidemiologists, the UN, the World Bank, the Institute for Health Metrics and Evaluation (IHME), the Sydney Social Sciences and Humanities Advanced Research Centre.
The epidemiology, demographics and economy data in this project are from the COVID-19 Open Data github repository [1] collated by Google Cloud Platform from reputable sources including DataCommons, Eurostat, WorldBank, Our World in Data, WHO, etc.:
The political system data is from the ‘List of countries by system of government’ Wikipedia page [2], collated from Bertelsmann Transformation Index 2012 [3] and Historical Atlas of the Twentieth Century [4].
The credibility of the data is verified by the trustworthy sources, including world renowned databases such as the World Bank, Our World in Data, WHO etc.
The validity of the data is justified by the high relevance to our project, comprising up-to-date timeseries COVID data and data on PSE factors of interest.
Datasets are merged by joining on country code or name (see Appendix #1), followed by data cleaning (see Appendix #2):
The clusters formed based on either new confirmed cases or stringency index are visualised as two world maps drawn using ggplotly. Countries in each cluster or subgroup are distinguished by colour.
An interactive map was created to simplify the viewing experience, by providing users with an overview of the relevant data sources (including new cases, stringency index and cluster groups) from a global perspective. The interactive element comes from the ability to pan and zoom, select different map types, choose to show cluster groups, and view the global change in data over time (through a play button) as the pandemic progresses. In addition, users may hover over specific countries and view detailed information including the political system, population, cases, stringency index and cluster group.
The reason we settled on using the bubble chart [Figure 1] is because we wanted to represent the political, social and economic information into a visualization. We had to effectively display these three variables into one concise visualization so that our consumers are able to draw conclusions based on the location, size and color of each bubble. For the ‘New Confirmed Cases’ cluster graph, we decided to look at the maximum stringency, percentage of infected population, constitutional form and GDP per capita. As for the ‘Stringency Index’ cluster graph, we decided to look at the percentage of deceased population, percentage of infected population, constitutional form and GDP per capita. In both instances, we wanted to determine if there was any relationship between countries of varying political governance and economic statuses and how they handled the COVID-19 pandemic.
cov_clusters <- read.csv("data/merged_covid_clusters.csv")
pse_graph = cov_clusters %>%
filter(value_col == "new_confirmed") %>%
filter(som_cluster == 1) %>%
group_by(country_name) %>%
mutate(max_total = max(total_confirmed),
max_stringency = max(stringency_index),
perc_infected_pop = max_total/population) %>%
arrange(desc(gdp_per_capita)) %>%
ggplot(aes(country_name = country_name, x=perc_infected_pop, y=max_stringency, size=gdp_per_capita, color=constitutional_form)) +
geom_point(alpha=1) +
scale_size(range = c(.5, 12), name="GDP Per Capita") +
scale_color_brewer(palette = "Paired") +
xlab("Percentage of Infected Population") +
ylab("Maximum Stringency") +
labs(color = "Constitutional Form") +
scale_x_continuous(limits = c(0, 0.2)) +
scale_y_continuous(limits = c(0, 100)) +
ggtitle("Figure 1: PSE Factor Bubble Chart")
pse_graph
In the preliminary exploratory stages, we needed to find out what the most optimal number of clusters were before we could start clustering the countries. Our group decided to look into the more popular methods such as gap statistic, elbow method and finally settled on the Silhouette method. The Silhouette method computes the average silhouette of observations for different values of k and the most optimal number of clusters is the one that maximizes the average silhouette over a range of possible values of k. We implemented it on our initial clustering algorithm, hierarchical clustering, which resulted in an optimal number of 6 clusters. This corroborated well with the number of clusters suggested by our self-organizing maps. Refer to Appendix #3 for the graph of Silhouette method.
To generate the clusters of similar countries, we use Self Organising Maps (SOM), which are a form of artificial neural network. SOMs take the dataset and maps data points to nodes based on similarity, with adjacent nodes being more similar to each other than further nodes.
SOMs was picked over other clustering methods like BIRCH clustering and kmeans clustering because it inherently reduces the dimensionality of the dataset into two dimensions, whereas a custom distance measure needs to be defined for the other methods. SOMs also have built-in cluster optimisation, with set rules on how large/small to make the grid, which reduces the amount of multiple testing involved in trying different numbers of clusters in the other methods, which would have increased the chance of seeing a good result by luck.
The SOM model itself is a 4x4 grid with hexagonal topology because we expect moderately high similarity across countries. Using the distances generated from the model, we can consolidate the nodes to get final clusters, which is visualised below [Figure 2]. Due to how small the plot is, we will only label a subset of the countries.
# read in COVID data
cov_data <- read.csv("data/data_smoothed_standardised.csv")
cov_data$date <- as.Date(cov_data$date)
# SOMs require the data to be in a matrix form, so we need to reshape the data with `dcast`.
VALUE_COL = "new_confirmed_smooth" # Name of the column to cluster on
# Filter for the value col and turn into a matrix
data2 <- cov_data[c("date", "country_name", VALUE_COL)]
country_matrix <- dcast(data2, country_name ~ date, value.var=VALUE_COL)
country_matrix[is.na(country_matrix)] <- 0
rownames(country_matrix) <- country_matrix$country_name
country_matrix <- subset(country_matrix, select=-country_name)
## Code referencing: https://clarkdatalabs.github.io/soms/SOM_Shakespeare_Part_1
## and https://www.r-bloggers.com/2014/02/self-organising-maps-for-customer-segmentation-using-r/
som_model <- som(
as.matrix(country_matrix),
grid=somgrid(xdim = 4, ydim=4,
topo="hexagonal"),
rlen=100,
alpha=c(0.05,0.01),
keep.data = TRUE)
# Cut at 6 neighbours and join with og names
som_cluster <- cutree(hclust(dist(som_model$codes[[1]])), 6)
country_clusters <- data.frame(
rownames(country_matrix), som_model$unit.classif
)
# Map nodes to cluster numbers
c_mappings <- data.frame(som_cluster)
c_mappings$node_name = sapply(rownames(c_mappings), substring, 2,)
c6_clusters <- merge(country_clusters, c_mappings,
by.x="som_model.unit.classif",
by.y="node_name")
colnames(c6_clusters) <- c("som_node", "country_name", "som_cluster")
# Extract original country name with spaces
c6_clusters$country_name <- gsub("\\.", " ",
c6_clusters$country_name)
# Define some countries to plot
to_plot <- c("United States of America",
"Italy", "Iran", "France", "Australia", "New Zealand",
"India", "Brazil", "China", "South Korea", "Japan", "United Kingdom", "Russia", "Sweden")
filtered_labels <- c()
X <- as.vector(sort(unique(c6_clusters$country_name)))
for (i in 1:length(X)) {
if (X[i] %in% to_plot) { filtered_labels[i] <- X[i] }
else { filtered_labels[i] <-"." }
}
plot(som_model, type="mapping", main = "Figure 2: SOM Clusters",
bgcol = rainbow(6)[som_cluster],
cex=0.8
, labels=filtered_labels
)
add.cluster.boundaries(som_model, som_cluster)
Each cluster is coloured differently, with the dark lines representing the borders of each cluster. Here, the USA and UK have similar behaviours in one cluster.
This step was repeated for each value to cluster on (new cases in various forms, and stringency index). A function automating this is in Appendix #4.
Given the nature of the pandemic, we have a strong prior for how certain countries have performed, and will use it to sanity check our clusters are valid. For example, anecdotally we know New Zealand kept cases very low throughout the pandemic, whereas the USA spiraled more out of control, so we expect to see these patterns reflected in the clusters.
For this, we plot the new cases/stringency over time for all countries in the same cluster, and visually inspect plots to ensure they are reasonably distinct from other clusters and similar to each other.
# Read in cached clusters for performance
cluster6_cache <- read.csv("data/c6_country_clusters_cache.csv")
cluster6_cache <- cluster6_cache[cluster6_cache$value_col %in% c("new_confirmed_smooth", "stringency_index"), c('country_name', 'som_cluster', 'value_col')]
cov_data_subset <- cov_data %>% select(date, country_name, new_confirmed_smooth, stringency_index)
cluster_df <- merge(x=cov_data_subset,
y=cluster6_cache, by="country_name")
Examining the clusters on smoothed cases, we find the following 6 clusters of countries [Figure 3]. Key observations to note:
clusters_cases_plot <- ggplot(cluster_df %>% filter(
value_col == "new_confirmed_smooth"
),
aes_string(x = "date", y = "new_confirmed_smooth", group = "country_name", color = "country_name")) +
geom_line(lwd = 1) +
theme_bw() +
ylab("Number of new cases") +
scale_y_continuous(labels = scales::comma) +
scale_x_date(date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 90)) +
labs(color = "Country/Region") +
xlab("") +
facet_wrap(~ som_cluster) +
ggtitle("Figure 3: Clusters on New Cases")
plotly::ggplotly(clusters_cases_plot)